Movie Clustering Project
Background
The movie industry is a market worth millions of dollars. However, not all movies that hit the box office are wildly successful. Movie watchers are unaware of how much they will enjoy a movie when they purchase a ticket. A movie’s popularity may be due to an experienced actor, famous director, and/or large budget. The Internet Movie Database (IMDb) publishes overall ratings for movies on a scale from one to ten. It is not determined by movie critics, but rather IMDb users. The average of the votes are weighted in order to avoid ballot stuffing. The score is used as a credible source to determine the success of the movie. If a movie is highly entertaining and well-liked, then a lot of tickets will be sold generating a large profit for filmmakers and actors.
Previous Research
Our dataset included the measure of Facebook likes for the movie, director, and top three actors. On the other hand, previous research used tweets from Twitter users to determine a movie’s rating. link to research. Similar to the IMDb score, this ranking was on a scale of 1 to 10. Twitter users ranked between 1 and 308 different movies. In all, their dataset included 65,115 ratings across 12,425 users. Their data however do not account for the gross profit, or notable actors and directors.
Conversely, three researchers at Princeton University determined that tweets were not an accurate measure to determine a movie’s rating link to research. Their work looks at the opinion of Twitter users against IMDb and Rotten Tomatoes. They found that Twitter can be characteristically different from general users, both in their rating and their relative preference for Oscar-nominated and non-nominated movies. More specifically, the authors created a measure of a movie’s “hype” that is calculated by the number of tweets about a movie before it is released. It showed that there are trends on Twitter based on the opinions of those in someone’s network. Lastly, these researchers determined that the data cannot correctly predict a movie’s box office success.
Two researchers at the University of Texas at Austin conducted research that was far more similar to ours link to research. They used the IMDb website to test whether the quality of the movie depended on the casts and crew involved. Then, they developed a predictor model for rating movies that are currently under production. In the end, they found that it is possible to predict the rating of a movie based on the scores of the crew members, however, their model was only able to do this with accuracy of 9.83 percent. The greatest error in their model was the fact that for some movies the crew members were not at the training set, which is the source of their data. The authors’ note that other factors such as the release date, location, and plot may influence the success of a movie. The issue with this is that those variables are not very quantifiable. One possible solution could be sentiment analysis on the text of the plots.
Our analysis is an extension of previous research because we use the IMDb score, which is a highly cited source, along with Facebook likes, profit, and budget. We then try to determine whether or not we can predict a movie’s success based on these variables of interest.
Question of Interest
Which factors contribute to a movie being more successful than another? In other words, what characteristics might a movie have to yield a higher IMDb score?
Data
Data Description
This movie data set was randomly generated from the IMDB database API, and organized and pulled from a Kaggle user. The data span 100 years across 66 different countries. It includes metrics about 5043 movies, and with observations from 28 variables. It gives particularly valuable information about movie: the movie title, year of release, genre/plot keywords, duration,content rating, the cast and certain crew members, the budget/box office gross revenue, user and critic votes/reviews, social media presence, and the IMDB movie rating.
Load the movie data and clean up the title column and some other data
movies <- read.csv("movie_metadata.csv")
#Remove a special character from start of column and trim white spaces
movies$movie_title <- gsub("Â", "", as.character(factor(movies$movie_title)))
vector <- str_trim(movies$movie_title, side = "right")
#Remap content rating
movies$content_rating <- mapvalues(movies$content_rating,
from=c("", "Not Rated", "GP", "Approved", "Passed",
"R", "M", "X","NC-17"),
to=c("Unrated", "Unrated", "G", "PG", "G",
"Adult", "Adult", "Adult", "Adult"))Exploratory Data Analysis
Budget vs Box Office Performance
ggplot(data=movies_new, aes(x=budget, y=gross, color=content_rating)) + geom_line() + labs(title = "Budget v Box Office Performance", x='Box Office', y = "Budget") + theme(axis.title = element_text(size= 9, face = 'italic', family = 'Verdana') )Cast Popularity v IMDb Score
ggplot(
data = movies_new,
aes(x = cast_total_facebook_likes, y = imdb_score))+
labs(x = "Total Cast Facebook Likes", y ="IMDb Score",
title = "Cast Popularity v IMDb Score")+
geom_point() + coord_flip() +
theme(axis.title = element_text(size= 8, face = 'italic', family = 'Verdana') ) ### Cast Popularity v Box Office Performance
ggplot(data = movies_new, aes(x= cast_total_facebook_likes, y = gross))+
labs(x = "Cast Total Facebook Likes", y ="Box Office",
title = "Cast Popularity v Box Office")+
geom_point() + coord_flip() +
theme(axis.title = element_text(size= 8, face = 'italic', family = 'Verdana') )Kmeans Clustering
Correlation Matrix on Qualitative Variables
Correlation Heatmap
ggcorr(quant_movies, label = TRUE, label_round = 2, label_size = 2.5, size = 2, hjust = .85, low = "red", mid = "white", high = "purple") +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))Choose Variables of Interest and Standardize
Select the Appropriate Number of Clusters
#Use the function we created to evaluate several different number of clusters
explained_variance = function(data_in, k){
# Running the kmeans algorithm.
set.seed(1)
kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
var_exp = kmeans_obj$betweenss / kmeans_obj$totss
var_exp
}View Elbow Chart
#Create a elbow chart of the output
explained_var_rep = sapply(1:10, explained_variance, data_in = movies_scaled)
# Data for ggplot2.
elbow_data_rep = data.frame(k = 1:10, explained_var_rep)
ggplot(elbow_data_rep,
aes(x = k,
y = explained_var_rep)) +
geom_point(size = 4) + #<- sets the size of the data points
geom_line(size = 1) + #<- sets the thickness of the line
xlab('k') +
ylab('Inter-cluster Variance / Total Variance') +
theme_light()Run K-means
Visualize Plot and Graph Results
Clustering Results on Movie Reception vs Monetary Success
ggplot(movies_scaled, aes(x = movies_scaled$imdb_score ,
y = movies_scaled$gross,
color = clusters_movies,
shape = clusters_movies)) +
geom_point(size = 4) +
ggtitle("Clustering results on Movie Reception vs Monetary Success") +
xlab("IMDB Score (scaled)") +
ylab("Box Office Gross (scaled)") +
theme_bw()Advanced Visualization of Clusters and Results
rating_colors = data.frame(content_rating = c("G", "PG", "Adult", "Unrated", "PG-13"),
color = c("red", "blue", "green", "yellow", "purple"))
rating_color_table = inner_join(movies_new, rating_colors, by = 'content_rating')
fig <- plot_ly(rating_color_table, title = "Voter Score vs Movie Box Office Gross by Content Rating" , x = ~imdb_score, y = ~gross, color = ~content_rating, colors = c("pink", "red", "purple", "light blue", "blue"), text = ~paste('Movie:', movie_title))
figQuality Assessment/Evaluation
Assessing the Quality of the Clustering through Variance Measurement
# Inter-cluster variance:
num_movies = kmeans_obj_movies$betweenss
# Total variance:
denom_movies = kmeans_obj_movies$totss
# Variance accounted for by clusters.
(var_exp_movies = num_movies / denom_movies)## [1] 0.2292543
The inter-cluster variance measurement of our K-means clustering was quite low with only a value of 0.226. It means that there wasn’t that much separation between our three cluster centroids. This could be due to some of the actual variables itself that were used in the dataset, such as the monetary and social media following metrics. While we did remove the variables with high collinearity that could have impacted our clustering, there were some variables that had a correlation coefficient magnitude between 0.5 and 0.7 we did not remove. Our threshold was at 0.7, maybe lowering it could have bettered the clustering and inter-cluster variance but we chose not to remove too many variables for the sake of getting a comprehensive input dataset.
However, we did run our algorithm with the recommended number of clusters from the elbow method and explained variance results. The elbow graph didn’t seem to plateau right away at a clear cluster value, so we experimented with k = 2 and k = 3 clusters. The gain in explained variance was almost linear so we stuck with k = 3 clusters.
Conclusion/Insights
There were 10 variables in total that were inputted into the K-means clustering algorithm. We choose to run it with k =3 clusters, guided both by the elbow method and the explained variance and the visual output of some of our exploratory data analysis. Some variables pertained to the monetary aspects of the movie: the box office revenue and budget. Other variables referred to the popularity of the movie and fame of the actors and cast through their social media following (the number of Facebook likes of the primary stars and the actual movie page). When we conducted a correlation analysis to remove variables to reduce collinearity, we removed the metrics that repetitively measured the number of likes of each actor and just used the cast_total_facebook_likes and the movie_facebook_likes to encompass the social media presence/online reputation of the movie. There were also some variables that related to the magnitude of people’s opinion on the movie through the number of user and critic reviews.
From the direct clustering results, there seems to be around a 3 way separation between all the input variables that are deemed to contribute to the success of a movie. We visualized our results with three of the most apparent validations for a movie’s success: 1) the movie reception by IMDb user score and 2) social media following, both vs box office performance. Both variables are able to give us a general idea of the movie’s reception.
For the visualization for IMDb Score vs Box Office Gross, there is one cluster gathered on the higher side of the IMDb score and the higher side of the box office gross. These are the movies that were well-received and performed financially well. Another large cluster, which apparently looks like the largest one, varies immensely in terms of IMDb score and user reception but is on the lower end in terms of revenue. The third cluster is interspersed between the two, and lies around the middle to average side regarding the scaled IMDb score and gross. Many popular films, such as Avatar, The Dark Knight, Avengers, Titanic fall into the category of high box office gross and IMDb rating. These are the big block-buster movies that were not only mass audience popular but also critically acclaimed. In the lower end cluster, the one with the most variety, there are movies ranging from documentaries to international films to animated pictures to romantic comedies to some action films as well. The composition of this cluster is diverse and the box office gross is quite low even if the film had a good user rating.
For the visualization for Movie Facebook Likes vs Box Office Gross, these clusters were a lot more muddled. The correlation between these two variables is not quite as strong, although there is one distinct cluster that is also the most widespread. Some movies with really high Facebook following did moderately well in terms of box office but others performed financially well without much online presence. Most of the movies fall into an average number of Facebook likes and Box Office Gross.
Our second Plotly visualization helped further elucidate some delineations in a movie’s success. We color separated IMDb rating vs Box Office gross by content rating. It seemed that the majority of the movies with a rating of PG and PG-13 did the most well in terms of box office gross but it varied pretty significantly on the user review score scale, as did the G-rated films. Adult films generally seemed to have lower box office revenue even when highly rated, probably because the potential audience is restricted.
Future Work
There were some inherent drawbacks within our dataset that could be cleared up for a similar analysis in the future. For example the films included in this dataset ranged from the early 1900s so these movies could be naturally less documented online. In addition, it is unsure if these older movies had their box office adjusted for inflation or not. These two factors could have influenced where the films were clustered along Social Media Following and Movie Voter Score vs Box Office Revenue. Thus, next time we would account for these differences by either separately testing older and newer movies or just adjusting for the monetary value. In addition, we used only Facebook likes in this dataset. Perhaps using other social media channels and online websites statistics (Google reviews, Twitter followers, Instagram page likes) and user reception measures (Rotten Tomatoes score, Metacritic grades etc.) could give us a more comprehensive picture regarding the online presence and reputation of the films. Future research branching off this work could include factoring other determinants of a movie’s success such as accounting for the time period it was released in (holiday season, summer etc.) and whether it won any awards or had some online virality.